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A novel method for predicting the spatial-arrangement 
topology of an amino acid sequence using free energy 
combined with secondary structural information 

5 

BACKGROUND OF THE INVENTION 

FIELD OF THE INVENTION 
This invention is concerned with a novel algorithm for 

10 determining the spatial-arrangement topology of a protein from 
its secondary structure. The invention is an essential aid in 
finding the topology and ultimately the three dimensional (3D) 
structure of unknown protein structures. The invention also 
employs the algorithm to estimate the structural stability of 

15 a protein with a given secondary structure and sequence using 
a global entropy evaluation method combined with thermodynamic 
parameters obtained from experimental data of binding free 
energy (FE) . The method also predicts the dominant folding 
pathway of a protein as a result of using this global entropy 

20 evaluation method. 



Description of the Related Art 
Determination of the three dimensional structure of a 
protein remains a very difficult process (1-5) . The most 
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successful approach is x-ray crystallography, which involves 
fitting diffraction data obtain at very large government-funded 
synchrotron facilities (1) . Although academic professionals 
are able to apply for government funding to measure and analyze 
5 protein structures at such facilities, the costs of maintaining 
and operating such facilities that provide the beam time to 
support these experiments are prohibitively expensive and not 
accessible for commercial enterprises to fund. Moreover, 
biology related commercial enterprises need this information; 

10 particularly pharmaceutical .companies where new drugs are 
always under development. 

Furthermore, protein structures obtained by x-ray 
crystallography require skilled techniques to express and 
crystallize a given protein before such a measurement can be 

15 made (7,8). It remains questionable whether all proteins can 
be crystallized and whether the crystalline structures fully 
represent the in vivo features of many biologically relevant 
proteins. Whereas many enzymes remain active even in this 
crystalline geometry (9) , the true dynamics of these structures 

20 and the range of conformations can only be inferred in the x-ray 
data because the protein structures are rigidly locked in a 
crystal. The in vivo structures of protein subunits are even 
more difficult to assess as crystals. 

A second approach is NMR spectroscopy (10,11). NMR 



spectroscopy is cost efficient for a company to carry out. 
However, this technique is often fraught with difficulties due 
to the time resolution of NMR experiments, the effects of 
solvent exchange and other complex coupling effects (7,8) . In 
addition, the same problems that hamper x-ray crystallography 

research protein expression, isolation, and 

characterization also render this approach costly. 

The easiest information to obtain accurately with NMR 
spectroscopy is the protein secondary structure (11) . However, 
the protein secondary structure carries insufficient 
information to unambiguously identify the topology of a given 
protein (12) . 

The most important topological information gained by NMR 
experiments is the nuclear Overhauser effect (NOE) constraints 
(10-12, see also USPTO 6, 512, 997). One must first obtain many 
unambiguous NOE-constraints to obtain a successful prediction. 
However, many proteins have highly ambiguous NOE-constraints 
or the NOE signals are too weak and broad banded to properly 
assign. In such cases, the protein structure cannot be resolved 
by NMR and the only remaining option is to turn to x-ray 
crystallography. 

A third approach is protein threading (13-20, see also 
USPTO numbers 6,512,981; 5,8 78,373; 5,884,230 and 6,377,8 93) . 
However, many proteins still have less than 25% homology with 
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known protein structures in the protein data bank (PDB) . To find 
a plausible template structure for a protein of 25% homology, 
considerably more information is needed to insure the accuracy 
of the prediction (13,17) and there is no objective method for 
5 deducing which structures make acceptable threads. 

A remaining option is to carry out a molecular dynamics 
(MD) simulation (21-26) . Currently, molecular dynamics 
simulations can be carried out on short peptide sequences (21) ; 
however, the time frame for a full protein refolding experiment 

10 remains intractable because of the long calculation times 
required from even the fastest computers (thousands of years 
even on a parallel processor supercomputer to achieve one ms 
of biological simulation time) . Moreover, the uncertainties 
and ambiguities of even the state of the art MD simulation 

15 program render whatever conclusions can be made from such a long 
simulation questionable (21,23-26). 

Combinatorial folding models of secondary structure 
alone (27,28) yield an intractable number of structural 
topologies to test in an MD simulation in explicit water (15, 26) . 

20 If the correct topology can be obtained, the computational cost 
of an MD simulation is drastically reduced and the confidence 
level of the predictions improved to a root mean square (RMS) 
deviation of no more than 3 A (29) . 

What is needed is an intermediate cost effective and 



•objective approach that can infer the topology without having 
to wait several millennia for the answer to be produced, 
applying for large grants to budget synchrotron machine time, 
spending long hours in the lab searching for ways to isolate 
5 proteins, or utilizing subjective methodologies to infer the 
protein structure. The topology indicates how the secondary 
structure of a protein is arranged spatially and is the main 
juncture between the secondary structure and the full 3D 
structure. The topology (spatial arrangement) cannot be 

10 obtained from the three-state secondary-structure alone. 

The invention is a semi-dynamic thermodynamic model of 
protein folding that we developed from RNA research (30) to 
account for the entropy of folding. Once the topology is known, 
a protein can be tested for its 3D structure with only a small 

15 fraction of the computer simulation time required for a complete 
protein refolding MD simulation. The invention is intended to 
aid the NMR and x-ray crystallographer in finding the 3D 
structure of an unknown protein based upon partially determined 
structural information, specifically the protein secondary 

20 structure . 

The importance of gaining a. foothold on protein topology 
cannot be emphasized enough . First, the experimental conditions 
that complicate the NMR experiments on proteins are generally 
the norm. Highly flexible proteins may have marginally stable 
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secondary structures that make their structures, difficult to 
determine experimentally with high precision by NMR. Second, 
functional proteins are dynamic entities, not static crystals 
(31,32). For regions of structure that exhibit a high degree 
5 of flexibility and polar-regions where there is rapid solvent 
exchange, NMR spectroscopy is limited by its time resolution 
(10,11). X-ray crystallography can obtain the structure of a 
protein that can be crystallized; however, the overall dynamics 
of the protein in solution are less clear. Topology prediction 

10 offers an independent tool to guide the structural 
determination and improve our understanding of the physics of 
protein structure and folding dynamics. 

The folding model considers the direction in which 
biological proteins are synthesized and transported through the 

15 cell as a basis for considering the . step-by-step thermodynamics 
of folding. 

SUMMARY OF THE INVENTION 

20 

This invention is a method implemented as a program for 
estimating the topology of a protein based on the combined 
information of a global entropy evaluation model and local 
thermodynamic potentials that express hydrophobic, polar and 



7 

electrostatic interactions as well as other corrections 
associated with size, shape or chemical properties. Parameters 
and models for these local thermodynamic potentials can be 
obtained from either theoretical sources, or from 
5 experimentally obtained data. The local interactions are 

modeled to help align individual protein secondary structure 
elements. Using a given amino-acid sequence and an obtained 
secondary structure estimate, the method is used to predict the 
best topology for the protein and to determine the dominant 
10 global folding kinetics of a protein in a biologically relevant 
fashion as described by the order and change in the protein' s 
topology during optimization of the free energy where the 
synthesis of a protein is expected to be from the amino terminal 
(N-terminal) end toward the carboxy-terminal (C-terminal) end. 

More concretely, the invention of this application is 
relating to a method to predict the topology of the spatial 
arrangement of an amino acid sequence using an entropy 
evaluation model that takes into account the global 
20 contributions of entropy to the folding of a biopolymer (herein 
referred to by the name cross linking entropy (CLE) and 
described in the literature) combined with other thermodynamic 
potentials as a protein-folding model. 

Further, the invention of this application may comprise 
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the following steps: inputting an amino acid sequence of a 
protein, 

A. inputting an amino acid sequence of a protein, 

B. preparing information on the secondary structure of 
the said amino acid sequence by way of at least one 
theoretical or experimental estimate, 

C. applying the CLE method to the said amino acid 
sequence and secondary structure information to 
evaluate the free energy of a combinatorial number 
of p-strand and a-helix arrangements as rapidly as 
polynomial time: c{n - X){n + 1) wherein c is a constant 
and n is the number of secondary structure elements 
found in the said amino acid in 2A and prepared in 
2B, 

D. applying the CLE method in conjunction with other 
thermodynamic potentials that approximate 
hydrophobic, electrostatic and polar interactions, 
but not limited to these aforementioned 
thermodynamic potentials stated herein, in a 
thermodynamic calculation to account for both short 
and long range folding interactions and predict a 
minimum free energy and corresponding topology of 
the said amino acid sequence, 

E. applying the CLE method to predict the global folding 
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kinetics of the said amino acid sequence, and 

F. storing the information in a data file or in other 
form of digital memory. 

In the invention of this application, the cross linking 
entropy (CLE) , which is an entropy evaluation model that takes 
into account the global effects of entropy in the folding of 
a biopolymer, can be used to evaluate the entropy loss of a 
protein due to folding into a particular topology given a known 
secondary or estimated secondary structure. 

Further, the invention of this application is relating 
to above-mentioned inventions, in which loss of biological 
activity of the protein can be further predicted. 

Further, in the invention of this application, a initial 
theoretical estimate of the secondary structure can be obtained 
from either a theoretical source, an experimental source such 
as an NMR experiment or x-ray crystallography, or both. 

The invention of this application further relates to 
above mentioned method, in which the theoretical estimate can 
be further supplemented with sequence alignment to find regions 
in which conserved segments remains essentially unchanged by 
differences in the aligned sequences. 

In the invention of this application, the amino acid 
sequence and secondary structure - information can be used to 
evaluate the free energy of a combinatorial number of p-strand 
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and a-helix arrangements as rapidly as polynomial time: 
c{n - X)(n + 1) wherein c is a constant and n is the number of 
secondary structure elements found in the said amino acid and 
obtained. 

5 Further, the invention of this application is relating 

also to a method to predict the topology of the spatial 
arrangement of an amino acid sequence comprising following 
steps: 

A. inputting an amino acid sequence of a protein, 

10 B. preparing information on the secondary structure 

of the said amino acid sequence by way of at least 
one theoretical or experimental estimate, 
E. applying the CLE method to approximate the global 

folding kinetics of the said amino acid sequence, 
15 G. applying the CLE method to the said amino acid 

sequence and secondary structure information to 
reduce the combinatorial number of 0-strand and 
a-helix arrangements to a computationally 
manageable number, and 
20 H. applying the CLE method to optimize the free energy 

to find the . most thermodynamically favorable 
topology for the said amino acid sequence,, 
wherein the global free energy (FE) contribution from the 
CLE between two distinct amino acid residues, herein labeled 
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/' and j, is calculated by equation (1): 
• 4 \ { 34 J 2&N,\ 

wherein, i and j represent the indices of two distinct 
residues in the said amino acid sequence, and j>i , 

AN 0 =j-i+l expresses the number of residues separating i 

and j , AG tj is the difference in the free energy 

contribution to the CLE from residues i and j 
transitioning from the denatured (random flight) state to 
the native state, AS, is the corresponding entropy loss, 

£ is the persistence length, y is a dimensionless weight 
parameter describing the self -avoiding properties of a 
polymer chain, k B is the Boltzmann constant, T is the 

temperature, and X i} (the bond gap) expresses the amino acid 

separation distance between the center of mass of residue 
i and the center of mass of residue j when both are treated 
as isolated molecules. 

Here, the total CLE contribution to the free energy ( A 
G cle ) can be calculated by equation (2) : 

AG cle = AG; + £ AG, + X ft f (£) ( 2 ) 

all _ bonds (ij) i'J' 
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wherein, AG, y is defined in equation (1), /'and / are 

indices specifying two secondary structure elements 
(a-helices or p-strands) that are joined together by the 
corresponding set of bonds / and j , f rf {^) is a positive 

definite penalty function used to enforce topology 
constraints on the minimum allowed sequence length of a loop 
connecting two elements of secondary structure /"' f and is 

a function of the persistence length and AG£ is a 

renormalization correction and is an integral function of 

£ as shown by equation (3) : 
Jy + \l2)Nk B T 



Ho-*) r 



(3) 



wherein, y, k B , and T mean the same as defined in claim 
7, Vindicates the number of amino acids in the said sequence, 
D is the dimensionality of the system, the limits in the 
integral ( 1 £ ) indicate the change in the number of degrees 
of freedom from an individual amino, acid reside to a cluster 
of £ amino acids treated as a group (where £>1 amino acid 
and £ need not be an integer) and x is dummy variable in 
the integral substituting for 

In the invention of this application, the optimal p-sheet 
alignments can be obtained by using thermodynamics. 

Further, the CLE method is applied in conjunction with 
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other derived or constructed thermodynamic potentials that 
approximate hydrophobic, electrostatic and polar interactions, 
in a thermodynamic calculation to account for both short and 
long range folding interactions and predict a minimum free 
5 energy and corresponding topology of the said amino acid 
sequence . 

The invention of this application is also relating to a 
method f 6r building a 3D structure of a protein for MD simulation 
from the topology obtained by one of the above-mentioned method. 
10 The invention is relating to a method to predict the 

topology of the spatial arrangement of an amino acid sequence 
using the entropy evaluation model, comprising the following 
steps : 



A. 



obtaining an amino acid sequence of a protein, 
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B. 



preparing information on the secondary structure of the 



said amino acid sequence by way of at least one 



theoretical or experimental estimate, 



E. 



applying the CLE method to approximate the global folding 



kinetics of the said amino acid sequence, 
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I. 



using the global folding kinetics to predict the optimal 



topology of the said amino acid sequence, and 



F. 



storing the information in a data file or in other form 



of digital memory. 



BRIEF DESCRIPTION OF THE DRAWING 



The file of this patent contains at least one drawing 
executed in colour. Copies of this patent with colour drawings 
will be provided by the Office upon request and payment of the 
necessary fee. 

The present invention may be more readily described with 
reference to the accompanying drawings, in which: 

Figure 1 shows IL-8 topology in the beta meander region. 
The yellow and orange squares represent residues in the p-strand 
regions whose rotors point out of the page and into the page 
respectively. The green residues are coil regions and red refers 
to a-helical regions. 

Figure 2 shows LECT2 topology in the p-sheet region. 
The yellow and orange squares represent residues in the p-strand 
regions whose rotors point out of the page and into the page 
respectively. The green residues are coil. 

Figure 3 shows the resulting structure of obtained by MD 
simulation and constraints of NMR secondary structure in 
conjunction with the calculated topology data of LECT2 , based 
on NMR constraints. 

Figure 4 shows the 3D structure of LECT2 after refining 
the NMR data further using information about the topology (after 



300 ps simulation) . The structure in Figure 3 is also the same 
shape, but there is considerable variation in the positions of 
the amino acids. 

Figure 5 shows topology and mapping of an antiparallel 
beta sheet structure. In the top right hand corner is a cartoon 
of the topology of an anti-parallel p-sheet protein. The 
P-sheets are assumed to be located at the amino acid positions 
(1-5), (11-15), (19-23), and (26-30), where, for example, the 
dash in (1-5) indicates residues between residue 1 and residue 
5. The red region indicates the areas of the map that involve 
the formation of the p-meander (strands 1, 2, joining 3) . The 
blue region marks off the areas of the map where a p-strand can 
form to the right hand side of the cartoon, and the green region 
indicates areas where the, p-strand can form on the left hand 
side of the topology cartoon. The global entropy strongly 
influences this joining, and for such a short turn region 
(residues 24 and 25), it can be inferred that the entropy will 
select the right hand side. For a long coil region, the entropy 
is likely to predict the left hand side, unless there are 
strongly incompatible residues in these locations. 

Figure 6 shows an example of a parallel beta sheet 
structure. The cartoon at the top right hand corner is used in 
the same way as figure 1. The sequence consists of beta strands 
located at (1-5) , (16-20) and (31-35) . It can be seen that the 
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direction of the beta strand linkages are different from Figure 
1. These are the primary distinguishing features between 
parallel and antiparallel beta-sheets. 

Figure 7 shows the topology and mapping of a protein that 
5 consists of only a-helices. Diagrammatically , the arrangement 
of the a-helix between local residues appears as the red bars on 
the map. The topological arrangement of these a-helices with 
respect to each other is indicated by the brown dotted lines. The 
dotted line is used because only some of the residues in the 
10 adjoining a-helix are actually making contact. In this example, 
the topological arrangement looks similar to the antiparallel beta 
sheets shown in Figure 1. 

Figure 8 shows the role of entropy in the formation of different 
15 kinds of secondary structure: (a) a-helix, (b) a pleated p-sheet, 
and (c) a Greek key structure. In these processes , the rate 
constants are as follows: k 2 > k 2 > k 3 . For both (a) and (b) , the 
whole structure is formed at roughly the same time and effectively 
collapses into such a structure. For (c) , the folding will first 
20 depend on k 2f and then on k 3 with k 2 > k 3 . 



DESCRIPTION OF THE PREFERRED EMBODIMENT 
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Definitions and background on biopolymer theory 

We first present a brief summary of the concepts used 
within this work. Special words used in other parts of this work 
are given in double quotation marks xx " when first encountered 
5 in this section. It is hoped that by seeing these words used 
in this context, the nearly interchangeable meaning of such 
expressions as 'entropy loss' and 'global entropy' will become 
clear. F6r further details and clarification on the theoretical 
concepts of biopolymer folding, reference 48 (although somewhat 

10 dated now) is very strongly recommended. Additional 
supplementary material can be found in such sources as A.Yu. 
Grosberg and A.R. Khokhlov, Statistical Physics of 
Macromolecules, AIP Press, Woodbury (NY), 1994; and M. Doi and 
S.F. Edwards, The Theory of Polymer Dynamics , Clarendon Press, 

15 Oxford, .1986; and P.G. deGennes, Scaling Concepts in Polymer 
Physics, Cornell University Press, Ithaca, . 1979. Another 
supplementary text important for understanding the concept of 
"renormalization theory" as used in this presentation can be 
gained from the first six chapters of J.J. Binney, N.J. Dowrick, 

20 A.J. Fisher, and M.E.J. Newman, The Theory of Critical 
Phenomena: an Introduction to the Renormalization Group, 
Clarendon Press, Oxford, 1992. 

The word "cross link" is used in a very broad sense in 
this work to describe a/iytype of bond (35). In proteins, typical 
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bonding effects include hydrophobic interactions, covalent 
bonds such as the disulfide bonds, hydrogen bonds, and salt 
bridges. Each bonding effect is considered a cross-link in this 
model . 

5 The entropy associated with the folding of a biopolymer 

is a statistical function associated with the number of 
conformations that are available to the biopolymer. For 
example/ if we restrict ourselves to the three regions of the 
Ramachandran plot that correspond to the right-handed alpha 

10 helix region (a R ) , the p-sheet or extended region (p or e) , or 
the left-handed a-helix region (a L ) , then an N residue peptide 
has 3 W conformations available to it (excluding corrections for 
excluded volume) . Even for such a grossly over-simplified model 
for the conformations, the number of possible conformation 

15 becomes astronomical for any reasonable sized protein. 

Therefore, the problem is usually reduced to a much 
simpler model, the simplest and most transparent being the 
"random flight model" also known as the "Gaussian polymer chain" 
(GPC) model. In the GPC-model, the individual monomers 

20 (referred to here as "mers" for short) are reduced to mere 
formless particles akin to "beads on a string". Although very 
abstract and seemingly not highly representative of any real 
monomers, such models are able to approximate some important 
features of polymer dynamics. 
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The GPC-model expresses the statistical probability of 
finding a given end-to-end separation distance r for the first 
and last mer in the polymer sequence and is expressed as follows 

5 p G (r) = C l N r 2 exp(-j3r 2 )Ar ( 0a) 

where J3 = 3/(2#W> 2 ) , C l N = 4n(/3 1 ' n) in , N is the number of monomers 
in the polymer chain, b is the separation distance between the 
monomers, and 4 is the persistence length expressed in units 

10 of monomer separation distance b. Clearly, p G (r)/ 47tr 2 Ar is the 
"probability density function (pdf ) " of a Gaussian distribution. 
For £>1 mer, the persistence length indicates that the 
neighboring mer (or mers) within a distance £ along the polymer 
chain exhibit "strong coupling' 7 (meaning that the motion of 

15 these molecules is "highly correlated") and their motions will 
not be sufficiently independent to treat as distinct mers. For 
mers separated by a distance greater than there is only "weak 
coupling" between the respective mers (low degree of 
correlation) and the elements can be treated as approximately 

20 independent . The structure of the GPC takes the form of jV(= N I 

"beads" separated by a distance b.{= %b) where £ mers are grouped 
into a single bead. It is easy to see that the root-mean-square 
separation distance between the first and last mer in Eqn (0a) 
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because Eqn (0a) is a Gaussian function. 

The GPC model is too simple a model for a real polymer 
because it does not even consider that the polymer chain is 
"self -avoiding" : a property in which no two mers of the polymer 
5 sequence can occupy the same spatial position at the same time 
(conservation of matter) . The first approximation of a 
self-avoiding polymer chain consists of a variant of the "Gamma 
function" that has the form 

10 p r (r) = C y r 2r exp(-ar 2 )Ar ( Ob ) 

where a = 3/(2 r <*Nb 2 ) , C r N = 2(3/2ATT ,/2 IT(y + 1/2) , Y(y + \I2) is the 

Gamma function, and y is a dimensionless parameter for which 
y= 1 renders Eqn (Ob) equal to Eqn (0a) . For all known polymers, 
15 y > 1. Because of the nature of Eqn (Ob) , we call this a "Gamma 
polymer chain (T-PC) " and the respective function a "Gamma-pdf 

(r-pdf)". 

For a system in which the size of the beads is the same as the 
mers, these pdfs permit us to express the entropy as follows 

20 

AS = S-S 0 =k B T ln(p(r)) = k B {ln(C£ ) + 2y ln(r / b) - or 2 } ( 0 c ) 

where *S 0 is a reference entropy and constant, a = 3/(2^Nb 2 ) and we 
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often refer to Eqn (Oc) in this method as the "cross linking 
entropy (CLE)". The maximum in Eqn (Oc) can be found by 
evaluating the force (f) as f(r) = -T(dAS/dr) and solving for the 
stationary point. Using this force equation, one finds a 
5 maximum at r 0 , with the correct r Q for Eqn (Oa) (shown above) 

and r Q = 2y%b 2 /3 for the T-pdf . This shows that the maximum in the 

entropy of this model occurs at r Q and for r <r 0 and r>r Q , the 
entropy decreases because fewer conformations are possible for 
the polymer chain. This decrease in entropy, due to restriction 
10 of conformations, is often called "entropy loss". This entropy 
(AS) is global because even when thousands or millions of mers 
separate the ends of the polymer chain, the correlation still 

increases where r G 2 oc TV . 

The formation of cross-links between the ends of the chain, 
15 r is compressed to some distance Ab where X is a dimensionless 
proportionality constant we refer to as the "bond gap" . 

The end-to-end correlation effects are not restricted to 
the terminal ends of the sequence. For a given monomer i and 
j (where i*j and i<j ), this weak coupling leads to a 

20 dependence in which r 2 oc (/-i + l)ft 2 . In the current embodiment, 

we use this to help estimate the entropy loss caused by the 
formation of bonds (or cross-links) between i and j . 
Furthermore, because we assume weak coupling between mers in 
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the polymer chain separated by a distance £ or greater, the 
coupling between a cross-link (bond) formed between mers i x - j } 
and that of i 2 -j 2 is sufficiently weak that they can be treated 
as independent permitting us to approximate the "global 
entropic contribution" by a summation. The summation of all the 
global entropic contributions is what we call the "total 
(global) entropic contribution" . For mers separated by a 
distance less than we apply "renormalization group theory" 
in which these mers are grouped as though they behaved as a single 
monomer of fractional size l/£. Approximations and corrections 
that account for this grouping are developed from this theory. 
Again, if the beads and mers exactly correspond in number, the 
global entropic contribution for mers / and j is similar to Eqn 
(Oc) 

= S u " S oo = k sT lnO?(A- )) = k B {ln(q ) + 2y \n{r tJ IV)- or}} ( 0 d ) 

where C£ and C r N have essentially the same meaning except that 
in Eqn (Od) likewise S ij0 and S Q , N = j-i + l (where j>i is 

assumed) . When the more likely situation is found where mers 
i and j do not correspond exactly to the bead size (£ > 1) , Eqn 
(Od) is weighted by a factor l/£to account of this "renormalized 
contribution" of mers / and j to the global entropy. Weighting 



Eqn (Od) by l/£ amounts to taking the "average contribution" 
of the group of mers to the global entropy of the beads. 

The ' x free energy (FE). contribution" from this "global CLE 
contribution" for the interaction of mers i and j is therefore 

AG 0 =-TAS 0 , and this FE, when viewed from the perspective of 
folding a biopolymer from its entropy maximum ( r Q ) to some small 
distance (r = Ab), can be seen to yield a posit ive value . Since 
the free energy must be negative to be considered spontaneous, 
a positive entropic contribution to the FE can be understood 
to be a "free energy cost", since some other interaction must 
make up the difference in the free energy to insure that the 
reaction remains spontaneous. 

In general, a "global entropy evaluation" of the effects 
of correlation between mer / and mer j is a function of the 
allowed conformations of the polymer change due to its chemical 
and mechanical properties. It is not a fully settled issue as 
to what extent complex systems such as a heterogeneous 
biopolymer can be simplified to a mere summation of the 
respective global entropic contributions corrected over a 
persistence length More forms of correlation might be 

anticipated than are currently addressed with the r-pdf. 
Therefore, in its most general form, the global entropic 
contribution or CLE between residue i and j of a protein is 
currently an unknown function. In the current embodiment, this 



function has been approximated by the T-pdf subject to 
corrections for £ > 1 mer. Other local interactions such as 
hydrophobic, charged and polar conditions are also assumed to 
be sufficiently weakly coupled so that they can be resolved at 
the monomer or at most the dimer level. 

In summary, the general form of the CLE should not be 
assumed to be a T-pdf. However, for practical reasons of 
computation, the T-pdf is used to evaluate the CLE in the 
embodiment and examples given here. As long as one can find some 
way to write some description of the CLE in terms of either the 
beads or the mers, the method for evaluating the CLE, the 
assumptions, the approximations, and the procedures used in 
evaluating the free energy of the biopolymer ("the CLE method") 
will amount to the same procedure. 

Theoretical secondary structure predictions 

Prior to use of this invention to determine the topology 
of the protein, information on ( the secondary structure is 
required. 

Determination of the secondary structure can come from 
a variety of sources . Theoretical estimates can be obtained from 
such sources as Predict Protein (33-37), Jpred (38,39), the 
secondary structure predictions from 3Dpssm (40,41), NNpredict 
(42,43), or PSIpredict (18-20). These theoretical estimates 
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can be further supplemented with sequence alignment obtained 
from BLAST (and PSI-BLAST) to find regions in which conserved 
segments such as the hydrophobic core (in particular) remains 
essentially unchanged by differences in the aligned sequences. 
5 Secondary structure information can also come from an NMR 
experiment or x-ray crystallography. 

The protein folding model: cross-linking entropy (CLE) 

For questions on algorithms discussed in this section, 

10 the reader is referred to the following reference book. G.F. 
Luger, Artificial Intelligence: Structures and Strategies for 
Complex Problem Solving, 4 th ed. Person Education Ltd, Essex, 
2002 and P. A. Pevzner, Computational Molecular Biology: an 
Algorithmic Approach, MIT Press, Cambridge (MA), 2000. 

15 Levinthal's Paradox points. out that if a protein must 

search all the possible conformations available to it, even a 
simple protein would take much longer than the lifetime of the 
universe to fold (44). To eliminate this large conformation 
space, nature must reduce the number of degrees of freedom in 

20 the folding process and the free energy should be such that 
folding has a funnel shape (44-47) . In the model, these degrees 
of freedom are expressed in the persistence length (£) (48) . 
A very stiff structure would have a long persistence length, 
would tend to remain correlated over several amino acids and 
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would fold as a unit rather than as individual residues 
(30,47,48). In the model, it is assumed that there are no 
significant kinetic traps (5,9) and that the denatured state 
of a protein (4,49) can be approximated by a random flight 
5 approximation (30) . We have shown that the CLE successfully 
models the global folding of RNA in conjunction with the 
thermodynamic parameters that are currently in use for these 
structure calculations (30) . Whereas the detailed packing 
rules of protein folding are fundamentally different from RNA 
10 (50), the entropy associated with the global folding 
conformations of a biopolymer can be treated as independent of 
the details of the particular system in the first approximation 
(48) . 

The CLE method helps to simplify the problem of 
15 calculating the topology of the protein. There are several ways 
in which this is accomplished. 

First, the amino acids are grouped in units of length 
where £ could be treated as a variable if more information is 
known about the flexibility of the biopolymer. Units of length 
20 £are of the same order as the length as the secondary structural 
units. Hence, the problem of handling N weakly coupled mers is 
reduced to one of handling approximately 2*N SS weakly coupled 
beads. This drastically reduces the problem of solving a 3 N 
conformation problem to a much simpler one of beads on a string. 
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Second, in applying the CLE, it is assumed that the three 
state secondary structure (i.e., cc-helix, p-strand, coil) is 
precisely known and we have only to consider how it is to be 
arranged in the folding process. At least 70% of the secondary 
5 structure can be predicted based on primary sequence data alone, 
and can be increased up to 80% by using sequence alignment 
techniques (33) . 

However, if no topological restrictions are applied in 
the folding problem, the number of ways in which N ss secondary 

10 structure elements can combine is factorial (2 Nss ~ ] NJ./2) . Hence 

for 8 secondary structure elements, there are more than 2.5 
million ways these structures could combine. To reduce the 
number of untenable combinations of p-strands to a manageable 
number, we use the CLE method to help deduce the global folding 

15 of the p-strand regions. 

Therefore, a third aspect of this method is the way in 
which folding is carried out. In the RNA problem, we have found 
that a branch and bound algorithm is sufficient for describing 
the folding of RNA. The reason is that RNA folds from the 5' 

20 to the 3' end and therefore, it is actually folding during 
synthesis or as it is transported through a biological membrane. 
A similar situation occurs in a protein where the synthesis and 
transport proceed from the N-terminus to the C-terminus. 
Therefore, in the limiting case, the branch and bound approach 
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is sufficient to solve this protein-folding problem within the 
CLE model. However, the branch and bound method still exhibits 
exponential growth in computer resource. The theory developed 
from the CLE shows that for such structures as pleated p-sheets 
5 are strongly favored entropically compared to other conceivable 
structures such as a jellyroll fold. Protein folds such as the 
Greek-key or the j ellyroll fold are rare compared to p-meanders . 
The reason is due to entropy. (A similar situation can be posited 
in the folding of adjacent a-helices in such proteins as 

10 cytochrome C.) A picture emerges in which the folding of a 
protein is quite rapid, and the dynamics are such that a protein 
effectively "grabs" the closest neighboring secondary that 
exhibits a reasonably favorable FE. Since the closest neighbor 
is likely to be the next adjacent strand, it should be no wonder 

15 why the p-meander structure is so common in protein folds (refs 
1-3, and although quite dated, we also recommend G.E. Schulz 
and R.H. Schirmer, Principles of Protein Structure, 
Springer-Verlag, New York, 1979) . Evolution has further 
fine-tuned this folding so as to usually produce a 

20 thermodynamically stable structure that can also be obtained 
through refolding (where N-terminus to C-terminus folding 
should not be assumed but rather nucleation) . The length of many 
proteins is often in the range of 200 amino acids or less. 
Secondary structure elements are typically on the order of 5 



amino acids in length (similar to , and most often, only half 
of the protein sequence forms secondary structure. This means 
that for a 200 amino acid protein, there are about 20 secondary 
structure segments . The jellyroll fold, one of the most complex 
in protein folding groups is composed of only six 0-strands. 
This means that the scan region for folding can be limited to 
about 8 elements in the first step. A recursive search (without 
replacement) can then be employed that preserves the dominant 
order in which the secondary structure units combine. Unlike 
the time it takes for the protein to fold, which is an exponential 
function of the global entropy (see example 5, Eqns (4) through 
(7)), the recursive search (without replacement) is done in 
polynomial time. If the secondary structures are all of the same 
size, then the time to align individual mers can be treated as 
a constant ( A t ) times the recursive search (without 
replacement) of the secondary structure elements or 

* search ^ A t B ( N ss + V( N ss ~ l ) ^ N l where B is a constant and max{^,}oc^ 3 
(depending on the procedure and assumptions used to evaluate 
A t ) . This is a substantial gain over other methods. The 
underlying assumption in the CLE model is clearly that proteins 
don't wait a long time to find their fold. In the event that 
much longer folding times are thought to occur, a 
branch-and-bound approach is also an option. 

We now discuss how the cross linking entropic 
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contribution to the free energy (FE) is evaluated as it pertains 
to using the Gamma polymer chain equation with no loss of 
generality for other methods that can express the entropy 
relationship between two residues / and j in an additive 
5 fashion. The total FE refers to the summation of the individual 
contributions of bonds (cross-links) i and j . In the current 
embodiment, £is treated as though it were a constant. However, 
the expressions can be modified to consider a variable if 
an experimental or theoretical estimate is available. 
10 The global entropic free energy cost for folding a 

denatured protein into a fi-sheet structure is expressed as 
follows. Let i and j represent the indices of two distinct 
residues in a protein sequence, where j>i. The number of 

residues separating i and j is AN 0 = y - / + 1 . The global FE 

15 contribution from the CLE of residues i and j can be 
approximated by the following expression 

where AG 0 is the difference in the free energy contribution to 
20 the CLE from residues / and j transitioning from the denatured 
(random flight) state to the native state, AS U is the 
corresponding entropy, loss, £ is the persistence length, y is 
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a weight parameter describing the self -avoiding properties of 
a polymer chain (/= 1.75 in three dimensions; Ref. 51), k B is 
the Boltzmann constant, T is the temperature, and X tj (the bond 

gap) expresses the amino acid separation distance between 
5 residues / and j in the native state. Typical values for £are 
on the order of 3 amino acids, but £ is highly sequence and 
structure dependent and can be much longer. 

The total CLE contribution becomes the sum of each AG 0 

contribution to the native state, 

10 

ag c/c =ag;+ X AG , + Z/y^) 

all _ bonds ( / J) i'J' 

(2) 

where AG 0 is defined in Eqn (1), /' and / are indices 

specifying two secondary structure elements (a-helices or 
15 p-strands) that are joined together by the corresponding set 
of bonds / and j , / y (£)is a positive definite penalty function 

used to enforce topology constraints on the minimum allowed 
sequence length of a loop connecting two elements of secondary 
structure V f and is a function of the persistence length 

20 and AG| is a renormalization group correction and is an integral 

function of 
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« ho-*) r 

where N indicates the number of amino acids in the said sequence, 
5 D is the dimensionality of the system, the limits in the integral 
(l->£) indicate the change in the number of degrees of freedom 
from an individual amino acid reside to a cluster of £ amino 
acids treated as a group (where £>1 amino acid and £ need not 
be an integer) . The value of AG* can also be found by fitting 
10 with the aid of experimental reference data. The contribution 
AG^ accounts for the ' fact that we have grouped individual 

residues together and therefore reduced the number of degrees 
of freedom available to the peptide sequence (30) . 

The global CLE is a function of the likelihood that a 
15 polymer would spontaneously adopt a specified configuration 
represented by Eqn (2) . Each cross-link adds to the global 
configuration entropy according to Eqn (2) . This results in a 

cumulative (integral) effect that grows as AN 0 ln( AN y ) . 

Ultimately this limits a domain or loop size because the CLE 

20 grows logarithmically { ln(AA^) } with each cross-link whereas the 

individual hydrophobic interactions are local and independent 

of AN 0 . The CLE has shown considerable progress in addressing 



the folding of RNA (30) . 

In our current global CLE evaluation strategy, we assume 
the hydrophobic effect is the dominant feature that leads to 
attraction between different secondary structure elements of 
the protein sequence (50,52). Thermodynamic parameters are 
dealt with based upon thermodynamic potentials in the local 
regime (52, 56-59). Structure is considered from standard models 
(3, 60-66). To help to orient the secondary structure, we use 
the polar, non-polar and hydrophobic interactions between 
neighboring p-sheets to help align the secondary (3,56). 
Hydrophobic parameters can be obtained from such sources as Ref 
45) . Other parameterizat ions can be either resolved based on 
approximations from molecular dynamic simulations, or various 
theoretical or experimental sources. 

To help to orient the p-strands, we use the polar, 
non-polar and hydrophobic interactions between neighboring 
P-sheets to help align the p-strands since these are likely to 
help stabilize the structure (3,56) . Scales of hydrophobicity 
can be obtained from such experimental sources as Y. Nozaki and 
C. Tanford, J. Biol. Chem. 246; 2211-2217 (1971) and references 
therein to name one. Theoretical sources could be for example 
T. Lazaridis, J. Phys . Chem. B, 102:3531-41 (1998). Other 
parameterizations can be either resolved based on 
approximations from molecular dynamic simulations, or 
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constructed from some first principle approach. 

This later approach follows a similar strategy first used 
by Cohen et al. (27,28) to find ways to arrange p-sheets 
5 together: termed p-sheet alignment. As Cohen et al. reasoned, 
the alignment of residues should follow sensible relationships 
for neighboring residues along a pleated p-sheet region such 
as similar hydrophobicity, complimentary acid/base 
interactions, etc. (56-59) . The improvement here is that we use 

10 this entropy to eliminate the multitude of combinations to a 
single topology or a finite set of solutions if dominant 
suboptimal structures are considered. Moreover, in this 
invention, the opt imal p — sheet alignments are obtained by using 
derived (or estimated) thermodynamic potentials rather than 

15 statistical relationships between different proteins. A 
statistical relationship approach would be a possible 
alternative in a tuned potential; however, mere statistical 
relationships provide limited insight on the origin of the 
effects or what contextual aspects influence the uncertainty 

20 whereas physical models provide a conceptual framework that can 
be improved upon with better understanding. 

A similar strategy is applicable in the case of a-helices. 
The major difference is that the residue alignments between two 
secondary structures much coincide with the cylinder shape of 
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an a-helix. Thus the contact points of residues along this face 
have gaps {i, i + 1, i+4, i+5, i+8, i + 9} where i is a reference 
position of the beginning of an a-helix (27,28). 

The protein folder works on the principle of crawling 
along the sequence from the N- to C- terminus searching for the 
nearest acceptable p-strand that permits the minimum entropy 
loss, balanced with favorable hydrophobic and ionic 
interactions. In this respect, the protein folder assumes that 
a protein folds as it is extruded from such biological 
structures as the ribosome rather than assuming that the protein 
specifically folds from the denatured state. 

The algorithm currently is primarily concerned with 
fitting all p-sheet proteins of approximately 70 amino acids 
in length. Longer sequences can in principle be solved using 
this approach with varying degrees of success in its current 
form. 

EXAMPLES 

Explanation of practical example is shown as follow. 
Example 1 

When the invention is applied to a standard p-meander 
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protein such as IL-8 (melanoma growth stimulating activity: 
1MGS) , the correct topology is obtained. Table 1 shows the best 
residue alignment and topology found in the calculation (67) . 
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Table 1. A calculation result of the topology of IL-8 (a known 

15 structure) using the invention. The ' ' indicates that no residue 

neighbors the region in a p-strand orientation. 



In the top section of Table 1, the left most column 
20 indicates the index of the given (5-strand segment of the amino 
acid sequence. The next two columns indicate the position at 
the N-terminus (N) and the C-terminus (C) of the p-strand. The 
next three columns indicate the determined topology of the 
protein sequence. The middle column labeled f i (forward strand 
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i) contains all the (3-sheet indices. The column immediately 
to the left (r_j: reverse strand j) indicates the p-strand 
immediately adjacent to f_i whose index is less than i. Likewise, 
the column immediately to the right of f_i ( f _ j : forward strand 
5 j ) indicates the p-strand immediately adjacent to strand i whose 
index is greater than i. The final column indicates whether the 
two p-strands that form the loop from antiparallel or parallel 
P-sheets (b- means antiparallel beta) . 

For a p-meander, the middle row (1 2 3) indicates that 

10 strand #2 has strand #1 to one side and strand #3 on the other. 
(In the topology, the axis of symmetry is two fold degenerate) . 

The bottom part of the table indicates the detailed 
topological arrangement of the mers in the p-strands relative 
to each other. The amino acids associated with strand i are 

15 denoted by i k where k specifies an individual amino acid in 
secondary structure i and similarly for j k . The expression i k -ji 
indicates that the residues (i k and j 2 ) from each strand (i and 
j) are directly adjacent to each other. 

The topology and physical alignment of the p-strand 

20 residues is shown graphically in the Figure 1 as a cartoon. The 
yellow and orange squares indicate rotors pointing out of the 
page and into the page. The predicted alignment and topology 
corresponds exactly with that of the known structure 1MGS . It 
can now be inferred that since our protein folder works from 
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the N-terminus to the C-terminus, the thermodynamic folding of 
this structure also occurs in the order strand #l-strand #2, 
follow by binding strand #2- strand #3. In a refolding 
experiment, both structures should form nearly simultaneously. 
5 If a branch-and-bound algorithm is used, then a refolding 
experiment can be mimicked. It has been found that RNA folds 
much faster when it is allowed to fold in the biologically 
relevant way, suggesting that these rules are applicable even 
to RNA biopolymers (S.L. HEILMAN -MILLER and S.A. WOODSON, 
10 Effect of transcription on folding of the Tetrahymena ribozyme 
RNA 2003 9: 722-733) . 

Example 2 

15 Similar to example 1, the structure of an undetermined 

protein LECT2 is also calculated using this model (68,69). 
Again, the topology is indicated in Table 2 in a similar way 
and a cartoon of the alignment and topology is shown in Fig 2 
(as an example) . 
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strand 3 + 4: 49-56, 48-57, 47-58, 46-59, 45-60, 44-61, 43- , 

10 42- 

strand 4 + 5: 56- , 57- , 58-68, 49-67, 60-66, 61-65 

strand 5+6: 68-73, 67-74, 66-75, 65-_ 

Table 2. A calculation result of the topology of LECT2 (an 
15 unknown structure) using the invention . The x _' indicates that 
no residue neighbors the region in a p-strand orientation. 

For LECT2, there are 6 p-strands and the alignment also 
has the form of a pleated p-strand structure. The pleated 
20 p-sheet fold and the p-meander fold are some of the most common 
structural motifs in the protein p-strand architecture (1-3) . 
Examples 3 and 4 show how this topology can be used to find the 
3D structure of the LECT2 protein. The topology of LECT2 is shown 
graphically in Figure 2. 
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Example 3 

Combining NMR chemical shift data with topology information 
NMR chemical shift data can now be used to evaluate the 
5 topological prediction by utilizing torsion angles estimates 
using such software as Talos (72,73) and weighting the 
predictions relative to the certainty of the Ramachandran 
angles using a power law such as x n 10 n ~ 2 , where x is the ratio 
of matching predictions to the best 10 predictions in the Talos 
10 database and n > 6. An example of a prediction using this 
combined method is shown in Fig 3. 

Example 4 

Building a trial 3D structures for MD simulation 

15 

Once a topology is obtained from this invention, the 3D 
structure can be built and refined through dynamical 
simulations using a number of approaches, for example Refs 29 
and 47. 

20 To fold the amino acid sequence into the correct secondary 

structure and topology, the structure is first built in extended 
form to mimic the approximate configuration of the P n 
structures found in denatured proteins (31,70,71). 

To introduce the secondary structure, torsion 



constraints are applied over the regions of secondary structure 
forcing out the secondary structure in the sequence using 
relatively strong torsion constraints on the order of k e = 10 
kcal/mol • rad 2 with a maximum potential energy (PE) of 100 
kcal/mol, where k e is the torsional force constant. To make the 
proposed contacts, weak distance constraints are then applied: 
k r <> 2.0 kcal/mol-A 2 (max. PE < 10 kcal/mole) , where k r is an 
effective spring constant. These distance constraints are 
typically applied to the C=0*--H bonds. The structure is then 
allowed to relax during an MD simulation (either in explicit 
water or in vacuo) using simulated annealing starting from a 
high temperature (at least 400 K for at least 10 ps) . The 
simulation time is doubled with each decrease in the temperature 
in a fashion akin to Newton's law of cooling. High temperature 
runs (T > 600 K) required k r > 1 kcal/mol-A 2 , but annealing at 
lower temperatures can be done with k r < 1 kcal/mol-A 2 . To help 
minimize the effect of large fluctuations in the structure at 
high temperatures, the time increment should be set to 0.125 
fs and strong torsion constraints should be placed on all the 
amide bonds to restrict the orientation in a trans configuration. 
For temperatures below 410 K, the time increment can be set to 
0.5 fs. This annealing process should be applied over a period 
of 0.5 to 1 ns. The approach differs from Refs. 29 and 45 in 
that we recommend using small distance constraints (k r < 2. 
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kcal/mol • A 2 versus 100 kcal/mol • A 2 ) and applying them over long 
and graduated simulated annealing times (preferably in excess 
of 200 ps) so that the structures have considerable time to 
explore the local conformation space. We also recommend 
5 starting with the P n conformation or an extended structure. 

The very high temperature permits sufficient thermal 
energy to help fix some of the mis-oriented residues in the turns 
At 300 K/ the distance constraints are reduced to 0 . 5 kcal/mol - A 2 
(max 1.0 kcal/mol) and the simulation is run for 100 ps to help 
10 the structure further relax. After annealing in the water bath, 
all topological constraints can be removed and the structure 
is allowed to relax at 300 for 100 ps . 

Figure 4 shows an example of the 3D structure of the LECT2 
protein built in this way. 

15 

Example 5 

Folding principles of various structures 

To help illustrate the folding model, in Fig 5, a p-meander 
cartoon structure forms the central region labeled 1, 2 and 3 
20 on the top right hand corner of the figure. In this example, 
the amino acid secondary structure is assumed to consist of the 
following strands: {l~5h, {11-15} 2 , {19~23} 3 , and {26~30} 4 , 
where the notation {ii-~i n }i indicates the specific indices of 
the amino acid fragment comprising the segment of secondary 
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structure. The tilde indicates the set of amino acids that 
occupy that fragment of the sequence. On the left-hand side is 
the representation of the structure on a two dimensional (2D) 
graph (3) . Each row corresponds to an index i and each column 
5 to an index j . 

The [3-strands 1, 2, and 3 are already decided in the 
structure. However, one must chose how to arrange the fourth 
strand. If the coil residues between the strands is very long, 

then Zy will be large in Eqn (1) and the structure can fold to 

10 the left (depending on the composition of the p-strand at 1) . 
On the other hand, if the sequence length is short, then the 
structure is likely to fold to the right. The possibilities 
are depicted in the triangle graph by the yellow arrows. The 

topological constraint function (f rf (%)) will force p-strand 4 

15 to link with strand 3 if the adjoining sequence is too short 
to permit crossing over to strand 1. In this case, stretching 
of the Gamma pdf (r > r Q ) will result in a large positive entropy 
as well. 

Protein folding can be largely limited to short range 
20 binding covering at most 10 secondary structure elements before 
settling into a structure. This is unique part of our invention. 
The CLE sets a limit on domain size due to the heavy weight of 
entropy loss. Topological restrictions such as the length of 



the loop region further restrict the allowed structures 
eliminating pointless combinatorial alignments when a 
connecting strand is shorter than a specified length. 

This can be taken to other structures. In figure 6 
(parallel p sheets) , the structure involves parallel p-strands: 
{l-5}i, {16~20} 2 and {31~35} 3 . The graph is different because 
of the way the sequences align. An a-helical structure can also 
be evaluated in this way (Figure 7) with the helical segments 
expressed by the red bars, and their spatial arrangement 
expressed by the dotted lines running diagonally on the triangle. 
The topological constraint function (/ y (£)) plays a primary 

role with parallel p-sheets because the amino-acid sequence 
must loop-back-around on itself. If the adjoining sequence 
lacks a sufficient number of residues, the loop-back-around 
would tend to bend (stress) the amino acid fragment excessively. 

The form of plotting used in this figure 5, is used to 
help understand the output and summarizes the main part of the 
data storage as vectors. Each structure contains information 
about the orientation of residues either to the side or above 
and below. The folder looks for the best assembly of these short 
fragments using the CLE and the local thermodynamic potentials 
discussed above. 
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A qualitative description of protein folding based on CLE model 
Here we provide a qualitative description of how the 
cross-linking entropy affects protein folding according to Eqns 
(1) through (3) . Because proteins tend to fold in a sequential 
5 manner (from N to C terminus) , one must visualize this process 
in as dynamic rather than static. Nevertheless, it should be 
remembered that the folding of a complex structure requires at 
least as 1 much sequence as the shortest loop of the domain. The 
entropy loss should be minimized and this usually favors short 

10 loops according to Eqn (1) . Only a very special type of amino 
acid sequence can produce a protein where a segment at its 
N-terminus actually waits while the rest of sequence folds and 
then hooks on as a p-sheet at the C-terminus. 

The hydrophobic effect is a strongly local phenomenon. 

15 Secondary structure may form with dominantly extended-like 
features (70) or even much of the basic secondary structure in 
tact (5), but the effects of attraction between the two 
P-strands is only possible when the two strands are in proximity 
of each other. 

20 Qualitatively, the CLE helps discriminate between 

different folding processes. In Fig. 8, we illustrate this by 
showing the formation of three different folds: (a) an a-helix, 
(b) a pleated p-sheet and (c) a Greek key fold. 

The cross-links between residues in an a-helix are at 



every fourth residue so AG,-,. (Eqn (1)) will be essentially a 

constant for all a-helix contacts and the formation rate will 
depend strictly on the local characteristics of the residues. 
This means the folding rate should be approximately 
simultaneous over the length of the a-helix in a refolding 
experiment . 

The folding of each loop in the (3-sheet of Fig. 8b proceeds 
independently of the other loops, but AG,., depends on AN tJ (Eqn 

(1) ) . Proximity is a critical aspect of protein folding rates 
for p-sheets where the residues / and j that are closest in 
sequence space ( AAf, y small implies AG ;> small) are also closest 

in physical space and therefore are more likely to combine 
sooner. Because the loops are of similar length in Fig. 8b, their 
approximate rates of formation will be nearly equal in a 
refolding experiment and for the most part can be considered 
simultaneous. However, whereas the first fold of Fig. 8c is 
depicted as the same rate ( k 2 ) as Fig. 8b, the second rate (k 3 ) 
is much slower and will usually occur long after first fold has 
formed. 

It is known that the rate of folding is much different 
between a-helices and p-sheets (74) where the rate for a-helices 
is much faster. This is used in the method developed in the CLE 
model . 
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Assuming the folding proceeds directly from the denatured 
state to the native state with no significant kinetic traps 
along the reaction coordinate, the rate of folding can be 
expressed as 



where h is the Planck constant, k B T Ih expresses the vibrational 
energy of the denatured state (9) . We express the cross-link 

10 between / and j as (i 3 j), and we rewrite Eqn (2) in terms of a 
group of cross-links encompassing a persistence length £ , and 
the corresponding indices / and j in terms of the effective 
indices / and j such that (i 9 j) . Assuming only 

anti-parallel p-sheets are involved, we can ignore .the 

15 . topological constraint contributions which are mostly 

necessary in computing parallel p-sheets . Hence, /Jy(£) = 0, and 
the FE becomes 



5 




(4) 




ott(iJ) 



(5) 



20 



where 



(6) 



48 



and AG 0 and AG° are defined in Eqns (1) through (3) . 

Qualitatively, the structure in Fig. 8c can be 
effectively grouped into four independent p-strands forming two 
P-sheets: AG de = AG| + (AG) 2 + (AG) 3 where AG 2 corresponds to the 

5 entropy-loss that occurs upon formation of the hairpin p-sheet 
structure (Fig 8b, the structure associated with the rate k 2 ) , 
and AG 3 correspond to the entropy loss to form the second 

parallel p-sheet associated with the rate k 3 (Fig 8c) . Let the 

total FE for the protein be defined as AG toial = AG loca/ + AG es + AG de , 
10 where AG /ocal includes such contributions as the solvation FE, 
the hydrophobic interactions, and other protein specific local 
interactions, and AG W contains any long-range electrostatic 
contributions to the FE. Assuming a reversible reaction, all 
such thermodynamic potentials only depend on the initial and 
15 final states of the system. Moreover, only AG c/e is a global 
quantity, the remaining FE only depends on local short-range 
interactions (hydrophobicity ) or are damped out by the 
formation of salt bridges (oppositely charged rotors) . Hence, 
for any specified structure the thermodynamic contributions 
20 from AG local + AG es can be estimated from the change in the initial 
and final topology. The long-range contribution of AG es is 
generally thought to be small due to solvent effects. Hence, 
the electrostatic contributions also appear to be largely 
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short-range and small. Likewise, AG° can be estimated from the 

flexibility of the structure (30) and, for fixed can be 
estimated from Eqn (3) . The rate determining contributions in 
this model are therefore AG 2 and AG 3 . If we further assume that 
5 these local interactions are the same between the p-strands, 
the contribution of these interactions will be shared almost 
equally between the residues and 

&G tola! =2(AG' JocaI + AG^) + AG c/e = 2AG' + AG c/<? , where the primes indicate 
the local and electrostatic interactions of the individual 
10 cross-links are approximated. 

Since, these contributions are separable from the rest 
of the expression in AG total we can estimate the protein-folding 
rate to be 



15 V n j 



-I 



k B T 



I AG local+ AG es + AG° \ 
H If 6XP 



(AG),) ( (AG) 3 ) 
— exp 



v RT 



RT 



= (k B T/h) Q xp{-(A 0 +2A)/RT}]J Wjj^CB 2 JJ W Tj g 

aH(7,j) oll(7J) 



(7) 



where A = AG' local + AG' es is the contribution from the local 
interaction and is negative quantity for any spontaneous 

reaction, A 0 =AG\ , C = exp(-^„) , B = exp(-A) 

20 W 7J.e = ex P(-( AG )ri,f / RT ) ' and V>Z corresponds to the index 2 or 3 
in this particular example. From Eqn (7), k 2 = CBW 2 and k 3 -CBW 3 . 
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Now given that the binding interactions for formation of 2 and 
3 are identical, (AG)~^is a positive-increasing function of 

AN* and AG 3 > AG 2 > 0 in Eqn (7). Therefore, k 2 > k 3 (Fig 8c) 

because k 3 corresponds to the longest chain (AN 2 <AN 3 ). 
5 The CLE provides a quantitative physical model with 

solutions that resemble the contact order model (75) , but more 
than that, the CLE shows why the contact order is a reasonable 
heuristic in protein and RNA folding. The CLE also predicts that 
cc-helices can fold to any conceivable length (76), and that 

10 they fold cooperatively because the contact points and their 
relative distance along the protein chain are uniform (Fig. 8a) . 
For pleated p-sheets (Fig. 8b), the folding rate of each loop 
depends on W 2 and p-sheet formation shows a distant dependent 
rate that is usually slower than the a-helix (74) : k 2 <k x (Fig. 

15 8a and b) . For the p-meander (Fig. 8b), the folding rate for 
all the same strand lengths will be of the order of W 2 and such 
identically shaped pleated p-sheet structures will appear to 
fold cooperatively on the time scale that is proportional to 
\IW 2 . In forming structures like a Greek key, the folding rate 

20 depends on the product W 2 W 3 . This will be a much longer folding 
time than the a-helix (Fig. 8a) or the pleated p-sheet (Fig. 
8b) structures (k 3 «k 2 ). Further, because this structure has 
a large entropy loss, few such structures are likely to be found 
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in nature. The Greek key and the jellyroll fold, although 
existing in nature, are far less common compared to the pleated 
p-sheet . 

5 Effect of the invention 

Finding the topology of a protein is an essential 
intermediate step between the simple three-state secondary 
structure prediction and the final 3D structure of a protein. 

Secondary structure in proteins only refers to whether 

10 the amino acid sequence has oc-helices, p-sheets or coil. Some 
programs also can indicate whether there is a high probability 
of a turn. Since there are 2 n ~ x n\/2 possible ways to arrange this 
secondary structure (where n is the number of secondary 
structures) , merely assuming all configurations are allowed 

15 yields an astronomically large number of structures to test for 
any real sequence. It costs considerable resource to test them 
all. Therefore, a powerful approach that can weed out the 
useless solutions is needed to help drastically reduce 
unnecessary ways of arranging the secondary structure. The 

20 invention has the strong capabilities to assist in research in 
this direction. 

This algorithm is expected to aid the protein structure 
researcher to find the topology of a protein quickly and 
effectively. For RNA structure calculations, we have already 
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developed an approach that significantly improves the 
performance of these calculations over existing standards (30) . 
We have applied it to protein structure calculations here 
because of the universality of this theory of biopolymer folding. 
5 To account for the difference between proteins and RNA, 
different parameterization and a different folding approach was 
used. 

Th'e invention also helps grant insight into the dominant 
folding pathway of a protein and can be helpful in understanding 
10 the effects of mutations on the topological stability of a 
protein. 
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